Predictive policing models have become a controversial manner in which police departments across the country allocate their resources. The thinking behind predictive policing models is to help police departments strategically allocate resources to better “fight crime,” by forecasting the location of crimes, the potential perpetrators, responses to crimes and investigative outcomes. However, these models and methods are riddled with bias, which has furthered inequity in policing, thus leading some cities to ban predictive policing.
We aim to develop a model to predict risk of criminal damage to vehicles in Chicago, by developing a predictive risk model using data from the Chicago Police Department, Chicago’s Open Data website, and census data. All features are chosen because of their relation to risk of criminal damage to a vehicle. Usually, to predict crime the Broke Window Theory is utilized, which ultimately uses indicators of poverty to predict crime. The Broken Windows Theory uses features of the built environment, such as blight, abandoned cars, broken windows, etc. to indicate “disorder” or “a tolerance for criminality.” Yet, by using these types of features, we have criminalized poverty, which has been done throughout history, and is an especially tense issue in Chicago where low-income neighborhoods bear the immense and unfair societal costs of this practice.
We hypothesize that risk of criminal damage to a vehicle is a function of exposure to certain geospatial risk and protective factors, which features used will be discussed later on in the report. However, our feature choices are subject to biases, as are all models. One type of bias is selection bias. For example, the crime data is of reported criminal damage to vehicles and there is a difference between reported crimes and actual crimes. The reasons this happens could be attributed to different groups of people are more likely to report crimes compared to other groups. Those that trust the police may report crimes a greater rate than those that have a deep mistrust of the police. Further, police officers are not stationed randomly across the city. There is the issue of selective enforcement, which could be based on an officers beliefs or preconceived notions of where crime is worse in a city. We also know that neighborhood with more non-White residents are policed more.
The aim of this report is to gain a better understanding of predictive policing models to make a recommendation on if they should be utilized by public entities. We use criminal damage to vehicles in Chicago as the exposure.
## Read in Data from Chicago
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
## Reading layer `OGRGeoJSON' from data source
## `https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON'
## using driver `GeoJSON'
## Simple feature collection with 25 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS: WGS 84
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = beat_num)
## Reading layer `OGRGeoJSON' from data source
## `https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON'
## using driver `GeoJSON'
## Simple feature collection with 277 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS: WGS 84
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"),
mutate(policeBeats, Legend = "Police Beats"))
#Choosing the crime of criminal damage to vehicle
criminal.damage <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
filter(Primary.Type == "CRIMINAL DAMAGE" & Description == "TO VEHICLE") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct()
chicagoBoundary <-
st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
st_transform('ESRI:102271')
## Reading layer `chicagoBoundary' from data source
## `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA///Chapter5/chicagoBoundary.geojson'
## using driver `GeoJSON'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS: WGS 84
The map on the left shows points of criminal damage to vehicles in Chicago, while the map on the right shows the density of criminal damage to a vehicle within the city. There is a high density of damage to vehicle north of the loop area in adjacent areas of Lake View, Lincoln Park, and Logan Square, with an extremely high concentration in the West Side of Chicago, and medium density in the Southwest and South side.
# uses grid.arrange to organize indpendent plots
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = criminal.damage, colour="red", size=0.1, show.legend = "point") +
labs(title= "Criminal Damage to Vehicles, Chicago 2017") +
mapTheme(title_size = 12),
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(criminal.damage)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of Criminal Damage to Vehicles, Chicago 2017") +
mapTheme(title_size = 12) + theme(legend.position = "none"))
##Creating a Fishnet Grid
## Creating a fishnet grid
#What is a fishnet grid?
#The `{sf}` package offers really easy way to create fishnet grids.
#Examine the fishnet - the unique ID is crucial to building a data set!
## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <-
st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
.[chicagoBoundary] %>% # <- MDH Added
st_sf() %>%
mutate(uniqueID = rownames(.))
#plot(fishnet)
Instead of looking at crime per police district, or police beat, which is a smaller administrative area, we need to aggregate point level crime data into a fishnet grid. By aggregating our crime data to a fishnet grid, it allow us to visualize crime risk across the city rather than dis-aggregating it by administrative area.
The fishnet map below shows hot spots in areas that have a high number of reported criminal damage to vehicles by grid square (defined as 500 feet by 500 feet). The squares in yellow indicate a higher count of reported vehicle damage. While there is not a significant amount of yellow, the lighter colors are in the same areas displayed on the density map. This map shows the clustered spatial process of criminal damange to vehicles in Chicago.
## add a value of 1 to each crime, sum them with aggregate
crime_net <-
dplyr::select(criminal.damage) %>%
mutate(countCriminal = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countBurglaries = replace_na(countCriminal, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = crime_net, aes(fill = countCriminal), color = NA) +
scale_fill_viridis() +
labs(title = "Count of Criminal Damage to Vehicles for the fishnet") +
mapTheme()
# For demo. requires updated mapview package
# xx <- mapview::mapview(crime_net, zcol = "countBurglaries")
# yy <- mapview::mapview(mutate(burglaries, ID = seq(1:n())))
# xx + yy
After visualizing the density and distribution of criminal damage to vehicles across Chicago, we add spatial features to better understand the areas that have a high density of criminal damage to vehicles. We hypothesize that areas that are newly gentrified have more criminal damage to vehicles. This could be due to reporting bias, newly gentrified areas may have people reporting damage, it could also be with the influx of new residents more are car owners. With a shift in residents comes a changing demographic and economic profile for those neighborhoods.
We use the spatial features of playground and grocery stores for areas that are experiencing or have recently experienced gentrification. The rational is that higher income areas tend to have well maintained amenities such as playgrounds and the establishment of large grocery stores. The other spatial features used, graffiti remediation, 311 reports of abandoned cars, sanitation complaints, street lights that are out, and liquor stores are features used in the controversial broken window theory. However, the broken window theory is a highly problematic theoretical basis to determine risk or place value on an area.
#Modeling spatial features and wrangling risk factors
abandonCars <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
graffiti <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street_Lights_Out")
liquorRetail <-
read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%
filter(business_activity == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
groceryStores <-
read.socrata("https://data.cityofchicago.org/resource/ce29-twzt.json") %>%
#filter(store_name == "JEWEL FOOD STORE") %>%
#str_detect(XF1,"Pool"
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Grocery_Stores")
playgrounds <-
read.socrata("https://data.cityofchicago.org/resource/eix4-gf83.json") %>%
filter(facility_n == "PLAYGROUND") %>%
dplyr::select(Y = y_coord, X = x_coord) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Playground")
## Neighborhoods to use in LOOCV
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
## Reading layer `chicago' from data source
## `https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson'
## using driver `GeoJSON'
## Simple feature collection with 98 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS: WGS 84
vars_net <- rbind(abandonCars, streetLightsOut, liquorRetail, graffiti, sanitation, playgrounds, groceryStores) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet, by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
## `summarise()` has grouped output by 'uniqueID'. You can override using the `.groups` argument.
# vars_net <- abandonCars %>%
# spatially join abandonCars points to the fishnet polygon they are within %>%
# drop the geometry attribute %>%
# group_by each cells ID and the name of the feature %>%
# summarize count the number of each point per grid cell %>%
# join that summary back to spatial fishnet by cell ID %>%
# "spread" from long to wide format and make column of our point count %>%
# tell R that this should be an sf object %>%
# remove a fussy column that appears b/c of NA %>%
# get rid of rows with an NA in any column %>%
# remove grouping so you are not tripped up later
vars_net.long <- gather(vars_net, Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =2, top = "Risk Factors by Fishnet"))
The risk factors visualized above are clustered in different areas of Chicago. Street lights, playgrounds, and 311 reports of abandoned cars are more dispersed throughout the city rather than clustered in specific neighborhoods. Liquor retail is clustered around the Loop.
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
# convinience to reduce length of function names.
st_c <- st_coordinates
st_coid <- st_centroid
vars_net <-
vars_net %>%
mutate(
Abandoned_Cars.nn =
nn_function(st_c(st_coid(vars_net)),
st_c(abandonCars),5),
Graffiti.nn =
nn_function(st_c(st_coid(vars_net)),
st_c(graffiti),5),
Liquor_Retail.nn =
nn_function(st_c(st_coid(vars_net)),
st_c(liquorRetail),5),
Street_Lights_Out.nn =
nn_function(st_c(st_coid(vars_net)),
st_c(streetLightsOut),5),
Sanitation.nn =
nn_function(st_c(st_coid(vars_net)),
st_c(sanitation),5),
Playground.nn =
nn_function(st_c(st_coid(vars_net)),
st_c(playgrounds),5),
Grocery_store.nn =
nn_function(st_c(st_coid(vars_net)),
st_c(groceryStores),5),)
vars_net.long.nn <-
dplyr::select(vars_net, ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol = 2, top = "Nearest Neighbor risk Factors by Fishnet"))
The above maps plot the distance to the k nearest neighbor risk factors in the Chicago fishnet grid. For our feature engineering we set k = 3 and use the average nearest neighbor distance to our risk factors. We use K nearest neighbors to test our hypothesis of a “smoother” exposure to risk relationship across space.
View(vars_net)
View(neighborhoods)
##Create Final Fishnet
#measuring distance to single point, centroid of Loop the central business district
loopPoint <-
filter(neighborhoods, name == "Loop") %>%
st_centroid()
vars_net$loopDistance =
st_distance(st_centroid(vars_net),loopPoint) %>%
as.numeric()
## important to drop the geometry from joining features
final_net <-
left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
final_net <-
st_centroid(final_net) %>% #take the centroid of the fishnet
st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>% #spatially join those withinthe nhood polygons nd polict districs. ie assign the neighborhood to the fishnet id for whichever nhood the fishnet centroid falls into
st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>% #get the fishnet back in to get the polygons. drop the geom to do the left join and then being it back in
st_sf() %>%
na.omit()
## Joining, by = "uniqueID"
# for live demo
mapview::mapview(final_net, zcol = "District")
For our analysis, we need to test for spatial autocorrelation, or the clustering of our variables and the model’s error measures. Using the Local Moran’s I, our null hypothesis is that criminal damage to vehicles is randomly distributed relative to its immediate neighbors, which is why we use the Queen’s weight matrix in our analysis.
##Mendez C. (2020). Spatial autocorrelation analysis in R. R Studio/RPubs. Available at <https://rpubs.com/quarcs-lab/spatial-autocorrelation>
## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods...
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
print(final_net.weights, zero.policy=TRUE)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 1844
## Number of nonzero links: 12824
## Percentage nonzero weights: 0.3771392
## Average number of links: 6.954447
## 5 regions with no links:
## 140 142 280 2203 2441
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 1839 3381921 1839 575.1293 7407.166
local_morans <- localmoran(final_net$countCriminal, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(Criminal_Count = countCriminal,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
From the results of our Local Moran’s I test, we reject the null hypothesis that criminal damage to a vehicle is randomly distributed relative to it’s immediate neighbors. Our Local Moran’s I plot below shows that there is a higher Local Moran’s I in places indicated by yellow, these high values show strong and statistically significant evidence of local clustering. Significance is determined by a Local Moran’s I test of greater than 5. The statistically significant hotspots are located around the Loop, the West side and close to Hyde Park.
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme() + theme(legend.position="bottom")}
do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics; Criminal Damage to Vehicles in Chicago"))
The next part of the analysis creates a variable that is the distance from each fishnet grid cell centroid point to the nearest significant cluster or hotpost. After creating this new variable, we can look at local information on the spatial process of criminal damage to vehicles.
final_net <-
final_net %>%
mutate(criminal.isSig =
ifelse(localmoran(final_net$countCriminal,
final_net.weights, zero.policy = TRUE)[,5] <= 0.0000001, 1, 0)) %>%
mutate(criminal.isSig.dist =
nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(
filter(final_net, criminal.isSig == 1))), 1))
### Plot NN distance to hot spot
ggplot() +
geom_sf(data = final_net, aes(fill=criminal.isSig.dist), colour=NA) +
scale_fill_viridis(name="NN Distance") +
labs(title="Distance to Highly Significant Criminal Damage to Vehcile Hotspots") +
mapTheme()
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -name, -District,-loopDistance, -countBurglaries) %>%
gather(Variable, Value, -countCriminal)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countCriminal, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countCriminal)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Criminal Damage to Vehicle count as a function of risk factors") +
plotTheme()
## `geom_smooth()` using formula 'y ~ x'
The correlation plots allow us to get a glimpse of which features may be useful in predicting criminal damage to vehicles in Chicago. The plots show each indicator plotted against count and nearest neighbors of criminal damage to vehicles. For our model, we will select either the count or nearest neighbors (not both, to avoid colinearity) each risk factor.
##Poission regression Now, we create two sets of independent indicators: Spatial Process and Just Risk Factors. Spatial Process contains the same features as Just Risk Factors with an additional local spatial feature. The sets are grouped by feature count or feature nearest neighbor. For our model, we will use a Cross-Validated Poisson Regression, based on the Poisson distribution curve. The models simulated distribution is shown in the histogram below and follows a Poisson curve.
# View(crossValidate)
## define the variables we want, creating two new independent variables data sets
reg.vars <- c( "Abandoned_Cars.nn", "Graffiti.nn", "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn","loopDistance","Grocery_store.nn", "Playground.nn")
reg.ss.vars <- c("Abandoned_Cars.nn", "Graffiti.nn", "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn","loopDistance","Grocery_store.nn", "Playground.nn", "criminal.isSig", "criminal.isSig.dist")
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(countCriminal ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
hist(final_net$countCriminal)
##Cross Validation We need to test our models performance on new data and on different neighborhoods. Now, we need to cross validate the four regressions. Two will use Leave-on-group-out cross validation (LOGO-CV) and two will use k-fold cross validation. LOGO-CV will be based on name of neighborhood. Both will use risk factors and distance to hotspot. The result of each cross validation is a dataset with observed and predicted burglary counts spatialized.
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countCriminal",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countCriminal, Prediction, geometry)
## This hold out fold is 62
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(id)` instead of `id` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(indVariables)` instead of `indVariables` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(dependentVariable)` instead of `dependentVariable` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## This hold out fold is 9
## This hold out fold is 45
## This hold out fold is 41
## This hold out fold is 34
## This hold out fold is 56
## This hold out fold is 24
## This hold out fold is 21
## This hold out fold is 40
## This hold out fold is 54
## This hold out fold is 74
## This hold out fold is 69
## This hold out fold is 64
## This hold out fold is 13
## This hold out fold is 46
## This hold out fold is 32
## This hold out fold is 6
## This hold out fold is 90
## This hold out fold is 66
## This hold out fold is 18
## This hold out fold is 58
## This hold out fold is 3
## This hold out fold is 30
## This hold out fold is 86
## This hold out fold is 15
## This hold out fold is 38
## This hold out fold is 57
## This hold out fold is 39
## This hold out fold is 43
## This hold out fold is 44
## This hold out fold is 10
## This hold out fold is 1
## This hold out fold is 70
## This hold out fold is 98
## This hold out fold is 37
## This hold out fold is 80
## This hold out fold is 14
## This hold out fold is 63
## This hold out fold is 94
## This hold out fold is 11
## This hold out fold is 101
## This hold out fold is 75
## This hold out fold is 31
## This hold out fold is 19
## This hold out fold is 25
## This hold out fold is 59
## This hold out fold is 99
## This hold out fold is 85
## This hold out fold is 83
## This hold out fold is 35
## This hold out fold is 79
## This hold out fold is 81
## This hold out fold is 26
## This hold out fold is 23
## This hold out fold is 48
## This hold out fold is 96
## This hold out fold is 76
## This hold out fold is 36
## This hold out fold is 82
## This hold out fold is 29
## This hold out fold is 22
## This hold out fold is 51
## This hold out fold is 61
## This hold out fold is 91
## This hold out fold is 102
## This hold out fold is 53
## This hold out fold is 72
## This hold out fold is 78
## This hold out fold is 84
## This hold out fold is 8
## This hold out fold is 17
## This hold out fold is 7
## This hold out fold is 16
## This hold out fold is 92
## This hold out fold is 33
## This hold out fold is 47
## This hold out fold is 89
## This hold out fold is 97
## This hold out fold is 55
## This hold out fold is 87
## This hold out fold is 5
## This hold out fold is 95
## This hold out fold is 93
## This hold out fold is 65
## This hold out fold is 88
## This hold out fold is 27
## This hold out fold is 68
## This hold out fold is 42
## This hold out fold is 73
## This hold out fold is 50
## This hold out fold is 60
## This hold out fold is 28
## This hold out fold is 71
## This hold out fold is 12
## This hold out fold is 77
## This hold out fold is 4
## This hold out fold is 2
## This hold out fold is 67
## This hold out fold is 49
## This hold out fold is 20
## This hold out fold is 52
## This hold out fold is 100
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countCriminal",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countCriminal, Prediction, geometry)
## This hold out fold is 62
## This hold out fold is 9
## This hold out fold is 45
## This hold out fold is 41
## This hold out fold is 34
## This hold out fold is 56
## This hold out fold is 24
## This hold out fold is 21
## This hold out fold is 40
## This hold out fold is 54
## This hold out fold is 74
## This hold out fold is 69
## This hold out fold is 64
## This hold out fold is 13
## This hold out fold is 46
## This hold out fold is 32
## This hold out fold is 6
## This hold out fold is 90
## This hold out fold is 66
## This hold out fold is 18
## This hold out fold is 58
## This hold out fold is 3
## This hold out fold is 30
## This hold out fold is 86
## This hold out fold is 15
## This hold out fold is 38
## This hold out fold is 57
## This hold out fold is 39
## This hold out fold is 43
## This hold out fold is 44
## This hold out fold is 10
## This hold out fold is 1
## This hold out fold is 70
## This hold out fold is 98
## This hold out fold is 37
## This hold out fold is 80
## This hold out fold is 14
## This hold out fold is 63
## This hold out fold is 94
## This hold out fold is 11
## This hold out fold is 101
## This hold out fold is 75
## This hold out fold is 31
## This hold out fold is 19
## This hold out fold is 25
## This hold out fold is 59
## This hold out fold is 99
## This hold out fold is 85
## This hold out fold is 83
## This hold out fold is 35
## This hold out fold is 79
## This hold out fold is 81
## This hold out fold is 26
## This hold out fold is 23
## This hold out fold is 48
## This hold out fold is 96
## This hold out fold is 76
## This hold out fold is 36
## This hold out fold is 82
## This hold out fold is 29
## This hold out fold is 22
## This hold out fold is 51
## This hold out fold is 61
## This hold out fold is 91
## This hold out fold is 102
## This hold out fold is 53
## This hold out fold is 72
## This hold out fold is 78
## This hold out fold is 84
## This hold out fold is 8
## This hold out fold is 17
## This hold out fold is 7
## This hold out fold is 16
## This hold out fold is 92
## This hold out fold is 33
## This hold out fold is 47
## This hold out fold is 89
## This hold out fold is 97
## This hold out fold is 55
## This hold out fold is 87
## This hold out fold is 5
## This hold out fold is 95
## This hold out fold is 93
## This hold out fold is 65
## This hold out fold is 88
## This hold out fold is 27
## This hold out fold is 68
## This hold out fold is 42
## This hold out fold is 73
## This hold out fold is 50
## This hold out fold is 60
## This hold out fold is 28
## This hold out fold is 71
## This hold out fold is 12
## This hold out fold is 77
## This hold out fold is 4
## This hold out fold is 2
## This hold out fold is 67
## This hold out fold is 49
## This hold out fold is 20
## This hold out fold is 52
## This hold out fold is 100
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countCriminal",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countCriminal, Prediction, geometry)
## This hold out fold is Riverdale
## This hold out fold is Hegewisch
## This hold out fold is West Pullman
## This hold out fold is South Deering
## This hold out fold is Morgan Park
## This hold out fold is Mount Greenwood
## This hold out fold is Roseland
## This hold out fold is Pullman
## This hold out fold is East Side
## This hold out fold is Beverly
## This hold out fold is Washington Heights
## This hold out fold is Burnside
## This hold out fold is Calumet Heights
## This hold out fold is Chatham
## This hold out fold is South Chicago
## This hold out fold is Auburn Gresham
## This hold out fold is Ashburn
## This hold out fold is Avalon Park
## This hold out fold is West Lawn
## This hold out fold is Grand Crossing
## This hold out fold is South Shore
## This hold out fold is Chicago Lawn
## This hold out fold is Englewood
## This hold out fold is Woodlawn
## This hold out fold is Clearing
## This hold out fold is Jackson Park
## This hold out fold is Washington Park
## This hold out fold is Garfield Ridge
## This hold out fold is West Elsdon
## This hold out fold is Gage Park
## This hold out fold is Hyde Park
## This hold out fold is New City
## This hold out fold is Fuller Park
## This hold out fold is Archer Heights
## This hold out fold is Brighton Park
## This hold out fold is Grand Boulevard
## This hold out fold is Kenwood
## This hold out fold is Oakland
## This hold out fold is Little Village
## This hold out fold is Mckinley Park
## This hold out fold is Bridgeport
## This hold out fold is Armour Square
## This hold out fold is Douglas
## This hold out fold is Lower West Side
## This hold out fold is North Lawndale
## This hold out fold is Chinatown
## This hold out fold is Near South Side
## This hold out fold is Museum Campus
## This hold out fold is Little Italy, UIC
## This hold out fold is West Loop
## This hold out fold is Austin
## This hold out fold is Printers Row
## This hold out fold is Garfield Park
## This hold out fold is Grant Park
## This hold out fold is United Center
## This hold out fold is Greektown
## This hold out fold is Loop
## This hold out fold is Millenium Park
## This hold out fold is Humboldt Park
## This hold out fold is West Town
## This hold out fold is River North
## This hold out fold is Streeterville
## This hold out fold is Ukrainian Village
## This hold out fold is East Village
## This hold out fold is Rush & Division
## This hold out fold is Wicker Park
## This hold out fold is Gold Coast
## This hold out fold is Old Town
## This hold out fold is Galewood
## This hold out fold is Lincoln Park
## This hold out fold is Belmont Cragin
## This hold out fold is Hermosa
## This hold out fold is Logan Square
## This hold out fold is Bucktown
## This hold out fold is Montclare
## This hold out fold is Sheffield & DePaul
## This hold out fold is Dunning
## This hold out fold is Avondale
## This hold out fold is North Center
## This hold out fold is Lake View
## This hold out fold is Portage Park
## This hold out fold is Irving Park
## This hold out fold is Boystown
## This hold out fold is Wrigleyville
## This hold out fold is Uptown
## This hold out fold is Albany Park
## This hold out fold is Lincoln Square
## This hold out fold is Norwood Park
## This hold out fold is Jefferson Park
## This hold out fold is Sauganash,Forest Glen
## This hold out fold is North Park
## This hold out fold is Andersonville
## This hold out fold is Edgewater
## This hold out fold is West Ridge
## This hold out fold is Edison Park
## This hold out fold is Rogers Park
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countCriminal",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countCriminal, Prediction, geometry)
## This hold out fold is Riverdale
## This hold out fold is Hegewisch
## This hold out fold is West Pullman
## This hold out fold is South Deering
## This hold out fold is Morgan Park
## This hold out fold is Mount Greenwood
## This hold out fold is Roseland
## This hold out fold is Pullman
## This hold out fold is East Side
## This hold out fold is Beverly
## This hold out fold is Washington Heights
## This hold out fold is Burnside
## This hold out fold is Calumet Heights
## This hold out fold is Chatham
## This hold out fold is South Chicago
## This hold out fold is Auburn Gresham
## This hold out fold is Ashburn
## This hold out fold is Avalon Park
## This hold out fold is West Lawn
## This hold out fold is Grand Crossing
## This hold out fold is South Shore
## This hold out fold is Chicago Lawn
## This hold out fold is Englewood
## This hold out fold is Woodlawn
## This hold out fold is Clearing
## This hold out fold is Jackson Park
## This hold out fold is Washington Park
## This hold out fold is Garfield Ridge
## This hold out fold is West Elsdon
## This hold out fold is Gage Park
## This hold out fold is Hyde Park
## This hold out fold is New City
## This hold out fold is Fuller Park
## This hold out fold is Archer Heights
## This hold out fold is Brighton Park
## This hold out fold is Grand Boulevard
## This hold out fold is Kenwood
## This hold out fold is Oakland
## This hold out fold is Little Village
## This hold out fold is Mckinley Park
## This hold out fold is Bridgeport
## This hold out fold is Armour Square
## This hold out fold is Douglas
## This hold out fold is Lower West Side
## This hold out fold is North Lawndale
## This hold out fold is Chinatown
## This hold out fold is Near South Side
## This hold out fold is Museum Campus
## This hold out fold is Little Italy, UIC
## This hold out fold is West Loop
## This hold out fold is Austin
## This hold out fold is Printers Row
## This hold out fold is Garfield Park
## This hold out fold is Grant Park
## This hold out fold is United Center
## This hold out fold is Greektown
## This hold out fold is Loop
## This hold out fold is Millenium Park
## This hold out fold is Humboldt Park
## This hold out fold is West Town
## This hold out fold is River North
## This hold out fold is Streeterville
## This hold out fold is Ukrainian Village
## This hold out fold is East Village
## This hold out fold is Rush & Division
## This hold out fold is Wicker Park
## This hold out fold is Gold Coast
## This hold out fold is Old Town
## This hold out fold is Galewood
## This hold out fold is Lincoln Park
## This hold out fold is Belmont Cragin
## This hold out fold is Hermosa
## This hold out fold is Logan Square
## This hold out fold is Bucktown
## This hold out fold is Montclare
## This hold out fold is Sheffield & DePaul
## This hold out fold is Dunning
## This hold out fold is Avondale
## This hold out fold is North Center
## This hold out fold is Lake View
## This hold out fold is Portage Park
## This hold out fold is Irving Park
## This hold out fold is Boystown
## This hold out fold is Wrigleyville
## This hold out fold is Uptown
## This hold out fold is Albany Park
## This hold out fold is Lincoln Square
## This hold out fold is Norwood Park
## This hold out fold is Jefferson Park
## This hold out fold is Sauganash,Forest Glen
## This hold out fold is North Park
## This hold out fold is Andersonville
## This hold out fold is Edgewater
## This hold out fold is West Ridge
## This hold out fold is Edison Park
## This hold out fold is Rogers Park
#goodness of fit metrics - generalizability across space, binds together observed and predicted counts and errors for each grid cell and for each regression
reg.summary <-
rbind(
mutate(reg.cv, Error = Prediction - countCriminal,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - countCriminal,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV, Error = Prediction - countCriminal,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - countCriminal,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
# calculate errors by NEIGHBORHOOD
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countCriminal, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
## `summarise()` has grouped output by 'Regression'. You can override using the `.groups` argument.
## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
facet_wrap(~Regression) +
geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) +
labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
x="Mean Absolute Error", y="Count")
plotTheme()
## List of 18
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "black"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 12
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 10
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.ticks : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.background: list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.text :List of 11
## ..$ family : NULL
## ..$ face : chr "italic"
## ..$ colour : chr "black"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ legend.title :List of 11
## ..$ family : NULL
## ..$ face : chr "italic"
## ..$ colour : chr "black"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ panel.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ panel.border :List of 5
## ..$ fill : logi NA
## ..$ colour : chr "black"
## ..$ size : num 2
## ..$ linetype : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ panel.grid.major :List of 6
## ..$ colour : chr "grey80"
## ..$ size : num 0.1
## ..$ linetype : NULL
## ..$ lineend : NULL
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ panel.grid.minor : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ plot.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ plot.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "black"
## ..$ size : num 16
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.subtitle :List of 11
## ..$ family : NULL
## ..$ face : chr "italic"
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.caption :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.background :List of 5
## ..$ fill : chr "grey80"
## ..$ colour : chr "white"
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ strip.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 12
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.text.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 14
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
The histograms above show the distribution of the Mean Absolute Error across each cross validation function and each independent feature set. The regressions that used LOGO-CV contain lower errors. The k-fold cross validated regressions are more clustered. When Just Risk Factors are used, lacking the spatial component, there are areas with greater than 15 errors. When the spatial feature in Spatial Process is incorporated the errors decrease. This means that there is a local experience of reporting criminal damage to vehicles across Chicago that needs to be incorporated into the predictive model, as it improves the model.
| Regression | Mean_MAE | SD_MAE |
|---|---|---|
| Random k-fold CV: Just Risk Factors | 0.80 | 0.66 |
| Random k-fold CV: Spatial Process | 0.79 | 0.65 |
| Spatial LOGO-CV: Just Risk Factors | 2.25 | 1.74 |
| Spatial LOGO-CV: Spatial Process | 2.22 | 1.72 |
mean(final_net$countCriminal)
## [1] 7.241323
The mean MAE is nearly the same in both regressions but slightly lower in the models including Spatial Process, indicating the importance of including spatial features. The mean observed criminal damage to vehicles is 7.2. There is relatively high error, which can be noted by comparing the mean error to the observed mean. The map below visualizes the LOGO-CV errors for both Just Risk Factors and Spatial Process. In both maps, the largest errors are in the same areas as hotspots.
error_by_reg_and_fold %>%
filter(str_detect(Regression, "LOGO")) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Criminal Damage to Vehicle errors by LOGO-CV Regression") +
mapTheme() + theme(legend.position="bottom")
The graphs below show the difference in the observed and predicted risk. The graphs show that all models over predict in areas with low criminal damage to vehicles and under predict in hostpots. The areas with over prediction, which have a low observed criminal damage to vehicles, are areas of latent risk. The large difference, or under prediction, in areas with very high observed criminal damage to vehicle events could be attributed to the model having trouble predicting hotspots.
st_drop_geometry(reg.summary) %>%
group_by(Regression) %>%
mutate(Criminal_Decile = ntile(countCriminal, 10)) %>%
group_by(Regression, Criminal_Decile) %>%
summarize(meanObserved = mean(countCriminal, na.rm=T),
meanPrediction = mean(Prediction, na.rm=T)) %>%
gather(Variable, Value, -Regression, -Criminal_Decile) %>%
ggplot(aes(Criminal_Decile, Value, shape = Variable)) +
geom_point(size = 2) + geom_path(aes(group = Criminal_Decile), colour = "black") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) + xlim(0,10) +
labs(title = "Predicted and observed criminal damage to vehicle by observed decile")
## `summarise()` has grouped output by 'Regression'. You can override using the `.groups` argument.
#Race Context
Next, we will test for generalizability across different neighborhood racial contexts. Using census data we will calculate the percent White and percent non-White for each census tract in Chicago.
census_api_key("717274061bb926a3200b42feacfac59e646398de", overwrite = TRUE)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
tracts18 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2018, state=17, county=031, geometry=T) %>%
st_transform('ESRI:102271') %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[neighborhoods,]
## Getting data from the 2014-2018 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
##
|
| | 0%
|
|= | 1%
|
|== | 2%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========== | 14%
|
|========== | 15%
|
|============ | 17%
|
|============= | 18%
|
|============== | 20%
|
|=============== | 21%
|
|================ | 23%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 28%
|
|===================== | 30%
|
|===================== | 31%
|
|======================= | 32%
|
|========================= | 35%
|
|========================== | 37%
|
|============================ | 40%
|
|============================= | 42%
|
|============================== | 43%
|
|=============================== | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================== | 49%
|
|==================================== | 51%
|
|====================================== | 54%
|
|======================================= | 56%
|
|======================================== | 57%
|
|========================================= | 58%
|
|========================================== | 60%
|
|============================================ | 63%
|
|============================================= | 65%
|
|============================================== | 66%
|
|=============================================== | 68%
|
|================================================= | 70%
|
|=================================================== | 72%
|
|==================================================== | 74%
|
|====================================================== | 77%
|
|======================================================= | 79%
|
|========================================================= | 82%
|
|========================================================== | 83%
|
|============================================================ | 86%
|
|============================================================== | 88%
|
|================================================================ | 91%
|
|================================================================= | 93%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|======================================================================| 100%
reg.summary %>%
filter(str_detect(Regression, "LOGO")) %>%
st_centroid() %>%
st_join(tracts18) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Mean Error by neighborhood racial context") %>%
kable_styling("striped", full_width = F)
## `summarise()` has grouped output by 'Regression'. You can override using the `.groups` argument.
| Regression | Majority_Non_White | Majority_White |
|---|---|---|
| Spatial LOGO-CV: Just Risk Factors | -0.8692636 | 0.9921556 |
| Spatial LOGO-CV: Spatial Process | -0.9522174 | 0.9749106 |
The tables looks at the error, which is the predicted criminal damage minus the observed criminal damage. A positive value indicated over prediction. By including race, we can see if the model over predicts in majority Non-White neighborhoods, or ones that are over 50% non-White. The table shows the opposite, that the model over predicts in Majority White neighborhoods and under predicts in Majority Non-White neighborhoods. Reasons for this could be that the hotspots are located in….
Traditionally, Kernal Density models are used to predict hotspots for policing, so police can target those areas. We will explore the difference between Kernel Density modeling and our risk predictions. Kernal Density models used spatial features to predict based on density, which relies on nearby observations. This is problematic because it is a predictive tool based on spatial autocorrelation. For our analysis we set a 1000 foot radius.
We will run both models on criminal damage to vehicles from 2017 to predict the locations in 2018. This allows us to see how generalizable each model is across time. We can see below how our model performed compared to the Kernel Density model.
## Get 2018 crime data
criminal18 <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>%
filter(Primary.Type == "CRIMINAL DAMAGE" &
Description == "TO VEHICLE") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102271') %>%
distinct() %>%
.[fishnet,]
# demo of kernel width
criminal_ppp <- as.ppp(st_coordinates(criminal18), W = st_bbox(final_net))
criminal_KD.1000 <- spatstat.core::density.ppp(criminal_ppp, 1000)
#criminal_KD.1500 <- spatstat.core::density.ppp(criminal_ppp, 1500)
#criminal_KD.2000 <- spatstat.core::density.ppp(criminal_ppp, 2000)
criminal_KDE_sf <- as.data.frame(criminal_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(criminal18) %>% mutate(criminalCount = 1), ., sum) %>%
mutate(criminalCount = replace_na(criminalCount, 0))) %>%
dplyr::select(label, Risk_Category, criminalCount)
criminal_risk_sf <-
filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(criminal18) %>% mutate(criminalCount = 1), ., sum) %>%
mutate(criminalCount = replace_na(criminalCount, 0))) %>%
dplyr::select(label,Risk_Category, criminalCount)
rbind(criminal_KDE_sf, criminal_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(criminal18, 3000), size = .5, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="2017 Criminal Damage to Vehciles Risk Predictions; 2018 Criminal Damage") +
mapTheme()
These maps show the points of reported criminal damage to vehicles in 2018. Underneath is the 2017 risk category.The Kernel Density model has more points clustered around areas in a high risk category. A strong model would show highest risk category or yellow areas with the highest amount of points. From the maps it is difficult to tell if it is a strong model, we need to calculate the rate of 2018 criminal damage points by risk category and type of model.
rbind(criminal_KDE_sf, criminal_risk_sf) %>%
st_set_geometry(NULL) %>% na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countCriminal = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crimes = countCriminal / sum(countCriminal)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(discrete = TRUE) +
labs(title = "Risk prediction vs. Kernel density, 2018 Criminal Damage to Vehicles") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
## `summarise()` has grouped output by 'label'. You can override using the `.groups` argument.
Above is a bar chart of both models by risk category. The risk prediction model is only slightly above the Kernel Density, except in the highest risk category, the Kernel Density model is a stronger predictor.
##Conclusion
We would not recommend this algorithm be put into production. We would not recommend for any predictive policing models to be used by public or private entities. In addition to selection bias discussed in the introduction, predictive policing models use past data to predict future outcomes, yet when basing the future on a past, those biases become integrated into the model. Its the paradox of using data, too much data collapses in on itself and creates a loop. In terms of policing, using predictive models to allocate police to hotspots reinforces pervasive racist practices of police departments that have been present throughout history. If there are more police present, they will make more arrests, regardless of the actual rate of crimes.
Using new technology feels exciting but it is not always the best route. Predictive policing models to not model risks equitably or adhere to algorithmic justice. Technology based on algorithms takes out the human part of understanding societal problems and solutions. A better use of risk prediction would be climate risk insurance modeling or predicting fires for Fire Departments. Even when predicting morbidity in a population, the results are based on indicators from the past and with their own biases and subjectivity, an recent example being COVID-19.